| City | State | Population | Central_Station_Coordinates | Air_Stations | Air_Station_Count |
|---|---|---|---|---|---|
| Melbourne | Victoria | 4929201 | -37.814, 144.963 | Moonee Ponds (Mason), Footscray, Kingsville, Brooklyn, Spotswood, Altona North, Melbourne CBD, Alphington | 8 |
| Sydney | New South Wales | 4892217 | -33.883, 151.206 | Anzac Memorial, Luna Lewisham, Sydney, Australia | 3 |
| Brisbane | Queensland | 2545882 | -27.467, 153.017 | Brisbane CBD, South Brisbane, Woolloongabba, Cannon Hill, Rocklea | 5 |
| Perth | Western Australia | 2205223 | -31.950, 115.861 | Subiaco | 1 |
| Adelaide | South Australia | 1399088 | -34.928, 138.601 | CBD West | 1 |

Introduction
Pollution is one of the most serious issues in the whole world due to their adverse effects on the human health and environment. Monitoring air quality is crucial, and it typically involves measuring concentrations of the following pollutants, which are common in the air and likely to show risk to human’s health:
PM2.5: Particulate air pollutant particles with a physical morphology that has an aerodynamic diameter of less than 2.5 μm. Particularly they are very fine particles of one micron and below, which loose itself in the lungs or sometimes in the blood stream to develop hazardous diseases such as respiratory diseases or cardiovascular diseases.
PM10: These are particles measuring 10 micrometers and below which are irritable and toxic to humans. Still, PM2.5 is a fraction of its size and such particles can be inhaled they lead to respiratory ailments.
SO2 (Sulfur Dioxide): Agglutination product burnt through burning of coal, oil and other carbon containing products accompanied by the formation of oxide. High concentration of SO2 affects people’s respiratory systems and is responsible for causing acid rain.
O3 (Ozone): Ozone at ground level is one of the most dangerous air quality pollutants formed by the chemical actions with NOx and volatile organic chemicals-VOC. Ozone also causes respiratory diseases and makes the asthmatic worse.
NO2 (Nitrogen Dioxide): A gaseous substance which is mainly emitted through car fumes and factories. NO2 is known to cause inflammation of human respiratory tracts, worsening of respiratory diseases and production of ozone.
CO (Carbon Monoxide): A colourless, nonflammable, nontoxic gas that is a product of the combustion of hydrocarbon or other carbon-containing materials in an incomplete combustion process. High levels of carbon monoxide cause problems with blood and this results to health complications.
For these pollutants, diurnal and spatial distribution of the pollutants is illustrated and concentration of the pollutants between the selected cities is also compared.
To answer the question, Which Australian city has the cleanest air?, we first needed to determine which cities to consider and how to define the boundaries for each city (i.e., which sensors to include). To keep the analysis clean and useful, we decided to focus on the top 5 Australian cities based on population.
To ensure consistent and fair comparisons across cities, we defined the boundaries as a 10 km radius from the central station of each city. All air quality sensors located within this radius were included in the analysis, as can be found in the table below.
To maintain consistency in our analysis, we defined a 10 km radius from the central station in each city and included all sensors within this area. The detailed map showing the exact sensor locations along with which pollutants it capture for each city can be found below.
Melbourne
| location | City | pm10 | pm25 | co | no2 | o3 | so2 |
|---|---|---|---|---|---|---|---|
| Alphington | Melbourne | Yes | Yes | Yes | Yes | Yes | Yes |
| Altona North | Melbourne | No | Yes | No | Yes | No | Yes |
| Brooklyn | Melbourne | Yes | Yes | No | No | No | No |
| Footscray | Melbourne | Yes | Yes | No | No | No | No |
| Kingsville | Melbourne | Yes | Yes | No | No | No | No |
| Melbourne CBD | Melbourne | No | Yes | No | No | No | No |
| Spotswood | Melbourne | No | Yes | No | No | No | No |
Perth
| location | City | pm10 | pm25 | co | no2 | o3 | so2 |
|---|---|---|---|---|---|---|---|
| Subiaco | Perth | Yes | Yes | No | No | No | No |
Sydney
| location | City | pm10 | pm25 | co | no2 | o3 | so2 |
|---|---|---|---|---|---|---|---|
| Anzac Memorial | Sydney | Yes | Yes | No | No | No | No |
| Luna Lewisham | Sydney | Yes | Yes | No | No | No | No |
| Sydney, Australia | Sydney | Yes | Yes | No | No | No | No |
Brisbane
| location | City | pm10 | pm25 | co | no2 | o3 | so2 |
|---|---|---|---|---|---|---|---|
| Brisbane CBD | Brisbane | Yes | Yes | No | No | No | No |
| Cannon Hill | Brisbane | Yes | Yes | No | No | Yes | No |
| Rocklea | Brisbane | Yes | Yes | No | No | Yes | No |
| South Brisbane | Brisbane | Yes | Yes | No | No | No | No |
| Woolloongabba | Brisbane | Yes | Yes | No | No | No | No |
Adelaide
| location | City | pm10 | pm25 | co | no2 | o3 | so2 |
|---|---|---|---|---|---|---|---|
| CBD West | Adelaide | Yes | Yes | No | No | No | No |
Data description
Data Collection:
The data was collected using the
airpurifyrpackage, which retrieves air quality measurements from the OpenAQ API, which is an open-source platform that aggregates air quality data from government and research organizations worldwide.Data is collected via sensors located in various cities and locations in Australia. The
get_measurements_for_location()function is used to pull data based on city, location, and time range.
Variables:
location_id: Identifier for the location of the sensor.location: Name of the sensor’s location.parameter: Type of air quality measurement (eg, PM2.5, NO2).value: The pollutant concentration.date_utc: Timestamp when the measurement was recorded.unit: Measurement unit (typically µg/m³).lat: Latitude of the sensor.long: Longitude of the sensor.country: Country code (e.g., AU for Australia).
Initial data analysis & Exploratory data analysis
Checking the data type
Below is a glimpse at the dataset.
tibble [44,522 × 14] (S3: tbl_df/tbl/data.frame)
$ location_id : int [1:44522] 5521 5521 5521 5521 5521 5521 5521 5521 5521 5521 ...
$ location : chr [1:44522] "Rocklea" "Rocklea" "Rocklea" "Rocklea" ...
$ parameter : chr [1:44522] "pm25" "pm25" "pm25" "pm25" ...
$ value : num [1:44522] 20.6 20.7 20.7 20.4 20.4 20 19.5 19.2 18.9 18.5 ...
$ date_utc : POSIXct[1:44522], format: "2024-08-31 00:00:00" ...
$ unit : chr [1:44522] "µg/m³" "µg/m³" "µg/m³" "µg/m³" ...
$ lat : num [1:44522] -27.5 -27.5 -27.5 -27.5 -27.5 ...
$ long : num [1:44522] 153 153 153 153 153 ...
$ country : chr [1:44522] "AU" "AU" "AU" "AU" ...
$ City : chr [1:44522] "Brisbane" "Brisbane" "Brisbane" "Brisbane" ...
$ State : chr [1:44522] "Queensland" "Queensland" "Queensland" "Queensland" ...
$ Population : num [1:44522] 2545882 2545882 2545882 2545882 2545882 ...
$ Central_Station_Coordinates: chr [1:44522] "-27.467, 153.017" "-27.467, 153.017" "-27.467, 153.017" "-27.467, 153.017" ...
$ Air_Station_Count : num [1:44522] 5 5 5 5 5 5 5 5 5 5 ...
After examining the structure of the dataset and the data types of its variable, we can infer that the data types are appropriately assigned based on the nature of each variable. For example, timestamps are appropriately stored as POSIXct for time-based analysis, and numerical values for pollutants are stored as numeric.
Aggregate by hours:
Below is a glimpse at the dataset.
# A tibble: 6 × 14
location_id location parameter value date_utc
<int> <chr> <chr> <dbl> <dttm>
1 5521 Rocklea pm25 20.6 2024-08-31 00:00:00
2 5521 Rocklea pm25 20.7 2024-08-30 23:00:00
3 5521 Rocklea pm25 20.7 2024-08-30 22:00:00
4 5521 Rocklea pm25 20.4 2024-08-30 21:00:00
5 5521 Rocklea pm25 20.4 2024-08-30 20:00:00
6 5521 Rocklea pm25 20 2024-08-30 19:00:00
# ℹ 9 more variables: unit <chr>, lat <dbl>, long <dbl>,
# country <chr>, City <chr>, State <chr>,
# Population <dbl>, Central_Station_Coordinates <chr>,
# Air_Station_Count <dbl>
We can see that the values of the parameters are recorded hourly, which means it is already aggregated by hour. However, some time intervals are not recorded, therefore we will add in rows with observations with the missing hours and the values being NA, for further analysis.
Next, we will check the total numbers of time stamps records for all the parameters across the locations. The recorded data is within 60 days, which means 1440 (hourly) records for each parameter of each location. The below code will show how many records of each parameters across all locations.
| parameter | location | timestamp_count | records_proportion |
|---|---|---|---|
| co | Alphington | 1341 | 93 |
| no2 | Alphington | 1340 | 93 |
| no2 | Altona North | 1351 | 94 |
| o3 | Alphington | 1343 | 93 |
| o3 | Cannon Hill | 719 | 50 |
| o3 | Rocklea | 719 | 50 |
| pm10 | Subiaco | 1392 | 97 |
| pm10 | Alphington | 1337 | 93 |
| pm10 | Anzac Memorial | 1398 | 97 |
| pm10 | Brisbane CBD | 719 | 50 |
| pm10 | Brooklyn | 1417 | 98 |
| pm10 | CBD West | 1398 | 97 |
| pm10 | Cannon Hill | 719 | 50 |
| pm10 | Footscray | 312 | 22 |
| pm10 | Kingsville | 1316 | 91 |
| pm10 | Luna Lewisham | 1393 | 97 |
| pm10 | Rocklea | 719 | 50 |
| pm10 | South Brisbane | 719 | 50 |
| pm10 | Sydney, Australia | 1398 | 97 |
| pm10 | Woolloongabba | 719 | 50 |
| pm25 | Subiaco | 1392 | 97 |
| pm25 | Alphington | 1411 | 98 |
| pm25 | Altona North | 1403 | 97 |
| pm25 | Anzac Memorial | 1398 | 97 |
| pm25 | Brisbane CBD | 719 | 50 |
| pm25 | Brooklyn | 1357 | 94 |
| pm25 | CBD West | 1398 | 97 |
| pm25 | Cannon Hill | 719 | 50 |
| pm25 | Footscray | 1208 | 84 |
| pm25 | Kingsville | 1307 | 91 |
| pm25 | Luna Lewisham | 1393 | 97 |
| pm25 | Melbourne CBD | 1414 | 98 |
| pm25 | Rocklea | 719 | 50 |
| pm25 | South Brisbane | 719 | 50 |
| pm25 | Spotswood | 1387 | 96 |
| pm25 | Sydney, Australia | 1398 | 97 |
| pm25 | Woolloongabba | 719 | 50 |
| so2 | Alphington | 1341 | 93 |
| so2 | Altona North | 1351 | 94 |
From the table above, we can infer that most of the parameters are not complete with the hourly time stamps, and parameters pm10 and pm25 are recorded the most within the data set, whereas the other pollutants show significant low records across locations, leading to lower completeness and bias when comparing pollutants level across different areas.
Checking the missingness and outliers
We will use the vis_miss() function from the visdat package to visualize the missingness of the dataset across variables, after completing the dataset with missing time stamps. We can see that in about 4% of the time, the data is not recorded.

To explore further the coverage of pollutants across cities, we will visualize the observation records on a time series plot to see which parameters are recorded across different cities and time intervals.

SO2, NO2, and CO appear to have less coverage across all locations, whereas PM25 and PM10 are recorded more consistently across the majority of locations, making them suitable for further analysis without the need for significant imputation or data handling. Besides, Brisbane only have data recorded from 1st Aug to 1st Sep, suggesting careful handling in further analysis.
Data distribution:
We will use box plot to have a glimpse at the distributions of the parameters across locations.

From the plot above, we can infer several key observations:
- CO, NO2: There are multiple high outliers, which may represent extreme pollution events or possible sensor anomalies.
- O3, PM10, and PM25: We identified extreme negative outliers for these pollutants, which are likely errors in the dataset.
- SO2: Some high outliers were observed, likely due to limited records or inconsistencies in the dataset.
To ensure our analysis focuses on typical air quality levels, we remove outliers—extreme values that could distort the results. The lower and upper bounds were determined using the interquartile range (IQR), where any values below 1.5 times the IQR from the lower quartile or above 1.5 times the IQR from the upper quartile were considered outliers.
By filtering out these outliers, we can focus on more accurate, typical pollutant readings, improving the quality of our analysis.
Results
As mentioned above, we focused on two key pollutants, PM10 and PM2.5, from the air quality data. For each city, we calculated the median value of these pollutants based on all the sensors that measured them. This gave us a representative value of air quality in each city. Finally, we organized the data so that each city has its own row, with separate columns showing the median levels of PM10 and PM2.5.
| City | pm10 | pm25 |
|---|---|---|
| Adelaide | 2.6 | 2.3 |
| Brisbane | 16.8 | 7.1 |
| Melbourne | 13.3 | 3.2 |
| Perth | 2.2 | 2.1 |
| Sydney | 3.3 | 3.1 |

The bar chart above compares the concentration of two key pollutants, PM10 (in orange) and PM2.5 (in blue), across five Australian cities: Adelaide, Brisbane, Melbourne, Perth, and Sydney. As seen in the chart, Brisbane and Melbourne have significantly higher concentrations of both pollutants compared to the other cities, with Brisbane showing the highest levels for both PM10 and PM2.5. In contrast, Adelaide and Perth have much lower concentrations, while Sydney falls in between.
This data prompts the need to define what we mean by “clean air” for the purpose of this analysis. In the sections below, I will discuss how we can establish thresholds or criteria to classify air quality based on the observed pollutant levels. This will help in determining which cities have cleaner air relative to others based on the levels of PM10 and PM2.5.
Based on the comparison of PM10 and PM2.5 levels across the cities, we decided to define “clean air” by taking the halfway point between the median concentrations of these two pollutants. This threshold will help us classify cities with better air quality as those falling below the midpoint and more polluted cities as those above it.
Using this criterion, we create a new plot to visualize which cities have cleaner air and which do not, based on their pollutant levels relative to this defined midpoint. This approach will allow us to make clearer distinctions between cities regarding their air quality.

| City | PM10 (Median) | PM25 (Median) | Midway Point |
|---|---|---|---|
| Perth | 2.2 | 2.1 | 2.1 |
| Adelaide | 2.6 | 2.3 | 2.5 |
| Sydney | 3.3 | 3.1 | 3.2 |
| Melbourne | 13.3 | 3.2 | 8.3 |
| Brisbane | 16.8 | 7.1 | 11.9 |
Based on the analysis, we can conclude that Perth has the cleanest air using the halfway point between the median PM10 and median PM2.5 values, narrowly surpassing Adelaide. However, it is important to note that both Adelaide and Perth have only one sensor each, which may introduce bias. While this is the best data available, future studies would benefit from each city having a similar number of sensors to ensure more accurate comparisons.
References
Wickham, H. (2019). *tidyverse: Easily install and load the 'tidyverse'*. R package version 1.3.0. https://CRAN.R-project.org/package=tidyverse
Wickham, H., & Bryan, J. (2019). *readxl: Read excel files*. R package version 1.3.1. https://CRAN.R-project.org/package=readxl
Tierney, N., & Cook, D. (2022). *visdat: Visualising whole data frames*. R package version 0.5.3. https://CRAN.R-project.org/package=visdat
Tierney, N., & Cook, D. (2022). *naniar: Data structures, summaries, and visualisations for missing data*. R package version 0.6.1. https://CRAN.R-project.org/package=naniar
Moritz, S. (2022). *imputeTS: Time series missing value imputation*. R package version 3.2. https://CRAN.R-project.org/package=imputeTS
Wickham, H. (2019). *rvest: Easily harvest (scrape) web data*. R package version 0.3.5. https://CRAN.R-project.org/package=rvest
Wickham, H. (2022). *conflicted: An alternative conflict resolution strategy*. R package version 1.0.4. https://CRAN.R-project.org/package=conflicted
Numbats. (n.d.). *airpurifyr: Air pollution modeling for Australia*. Retrieved from https://numbats.github.io/airpurifyr/
Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis*. Springer-Verlag New York. https://ggplot2.tidyverse.org
Kahle, D., & Wickham, H. (2013). *ggmap: Spatial visualization with ggplot2*. The R Journal, 5(1), 144-161. https://journal.r-project.org/archive/2013-1/kahle-wickham.pdf
Sievert, C. (2020). *Interactive web-based data visualization with R, plotly, and shiny*. Chapman and Hall/CRC. https://plotly-r.com
Hijmans, R. J. (2019). *geosphere: Spherical trigonometry*. R package version 1.5-10. https://CRAN.R-project.org/package=geosphere
Zhu, H. (2021). *kableExtra: Construct complex table with 'kable' and pipe syntax*. R package version 1.3.4. https://CRAN.R-project.org/package=kableExtra
OpenAI. (2023). *ChatGPT (October 2023 version) [Large language model]*. https://chat.openai.com